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Abstract 

The term 'sake yeast' is generally used to indicate the Saccharomyces cerevisiae strains that possess 
characteristics distinct from others including the laboratory strain S288C and are well suited for sake 
brewery. Here, we report the draft whole-genome shotgun sequence of a commonly used diploid sake 
yeast strain, Kyokai no. 7 (K7). The assembled sequence of K7 was nearly identical to that of the 
S288C, except for several subtelomeric polymorphisms and two large inversions in K7. A survey of 
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heterozygous bases between the homologous chromosomes revealed the presence of mosaic-like uneven 
distribution of heterozygosity in K7. The distribution patterns appeared to have resulted from repeated 
losses of heterozygosity in the ancestral lineage of K7. Analysis of genes revealed the presence of both 
K7-acquired and K7-Iost genes, in addition to numerous others with segmentations and terminal discre- 
pancies in comparison with those of S288C. The distribution of Ty element also largely differed in the two 
strains. Interestingly, two regions in chromosomes I and VII of S288C have apparently been replaced by Ty 
elements in K7. Sequence comparisons suggest that these gene conversions were caused by cDNA- 
mediated recombination of Ty elements. The present study advances our understanding of the functional 
and evolutionary genomics of the sake yeast. 

Key words: Saccharomyces cerevisiae; sake yeast; genome sequence; diploid; loss of heterozygosity 



1. Introduction 

Sake is a traditional Japanese alcoholic beverage 
that is fermented from steamed rice by the concerted 
action of two types of microorganisms, filamentous 
fungi and yeast. In the production of sake, enzymes 
secreted by the fungus Aspergillus oryzae, which is 
grown on steamed rice, convert rice starch into 
glucose. Yeast cells in the sake mash then produce 
ethanol, higher alcohols and their esters, organic 
acids and amino acids, which are important com- 
ponents that contribute to sake aroma and taste. 
Hence, the choice of a yeast strain is one of the 
most critical factors in determining the resulting 
aroma and taste characteristics of sake products. 
Yeast strains that were originally isolated in sake brew- 
eries have been identified as Saccharomyces cerevisiae 
and are now commercially distributed as sake yeast. 1 

Phylogenetic studies conducted using DNA markers 
have indicated that sake yeast strains are closely 
related, forming a sake strain cluster that belongs to a 
lineage distinct from other industrial and laboratory 
strains in the phylogenetic tree of S. cerevisiae. 2 ' 5 
Comprehensive genome-wide studies of diverse 
S. cerevisiae strains have also indicated the existence of 
a unique sake cluster that is distinct from wine and lab- 
oratory strains. 6,7 Consistent with their unique phyloge- 
netic position, sake yeast strains possess characteristic 
traits that differ from other S. cerevisiae strains and 
are ideal for sake brewing, including ability for high 
ethanol productivity (reaching 20%) 1,8,9 efficient 
growth and fermentation at low temperatures (below 
1 5°C).^ In addition, nearlyall sake yeast strains generate 
a foam on the mash surface during the brewing process, 
which results from yeast cells fermenting sugars 
into C0 2 bubbles 10 and possess biotin biosynthetic 
ability. 1 1 

Several sake yeast-specific genes that are responsible 
for the desirable features of sake yeast and affect the 
brewing process have been identified. For example, 
AWA1 , which encodes a cell-surface hydrophobic 
protein with a GPI anchor, was identified from a sake 
yeast genome and is responsible for foam formation 
in sake mash. 10 Many S. cerevisiae strains are biotin 



auxotrophs due to the lack of certain biotin biosyn- 
thetic pathway genes. BI06, a homolog of a bacterial 
biotin biosynthetic pathway gene, was identified in 
sake yeast strains and is essential in them for the pro- 
duction of biotin. 11 In addition to these studies, 
several genes involved in yeast's quantitative pheno- 
types, including fermentability and aroma production, 
have been suggested. 1 2,1 3 However, because the 
genetic basis for the superiority of sake yeast in sake 
brewing is largely unknown, genome-wide genetic 
approaches are required to understand these complex 
traits. Since the first complete genome sequencing of 
strain S288C in 1 996, 14 the genomes of several other 
S. cerevisiae strains have been sequenced. 1 5-1 8 Strains 
such as K11, Y9 and Y1 2 were also subjected to 
genome sequencing and the sake cluster was conse- 
quently proposed. However, in each case, total length 
of sequence reads reached <0. 9-fold of the haploid 
genome, not enough to get whole picture of the 
genome. 6 In addition, these sequenced strains are not 
the typical industrial sake yeast strains. Accordingly, 
yeast strains used in sake brewing have not been sub- 
jected to whole-genome analyses, despite their impor- 
tance in industry and yeast phylogenetic systematics. 

In the present study, we performed the whole- 
genome sequencing of the sake yeast strain K7 
(kyokai is the Japanese word for 'society') as the first 
step for subsequent functional, phylogenetic and evol- 
utionary genomic studies. K7 has been one of the 
most extensively used industrial sake yeast strains 
over the past several decades and has also been 
employed in numerous genetic and biochemical 
studies as a model sake yeast and a parent strain for 
breeding. 8,1 0-1 3,1 9-24 Here, we report the overall 
chromosome structure and remarkable features of 
the K7 genome. 

2. Materials and methods 

2.1. Strain 

Saccharomyces cerevisiae K7 was distributed to 
Japanese sake breweries by the Brewing Society of 
Japan in 2 004 and used as a DNA donor. 
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2.2. Sequencing, assembly and validation 

The nucleotide sequence of the K7 genome wasdeter- 
mined using the whole-genome shotgun sequencing 
approach. Genomic DNA was isolated from cells accord- 
ingtothe method described by Hereford etal. 25 Plasm id 
libraries with average insert sizes of 1 .6 and 5.0 kbwere 
constructed in pUC1 1 8 (Takara Bio Inc.), while afosmid 
library with an average insert size of 35 kb was con- 
structed in pCC1 FOS (Epicentre Biotechnologies), as 
described previously. 26 Raw sequence reads correspond- 
ing to a 9.1 -fold coverage of the haploid genome 
(42 842, 88 830 and 1 5 227 reads from libraries with 
1.6, 5.0 and 35 kb inserts, respectively) were first 
obtained by sequencing from both ends of the inserts 
on an ABI 3730x1 DNA Analyzer (Applied Biosystems). 
Sequence reads were trimmed at a threshold quality 
value (Phred) of 20 and assembled using the Phrap 
assembler. 27,28 We obtained a total of 712 contigs 
that were then put in order based on paired-end infor- 
mation from the constructed fosmid library to obtain 
supercontigs. The overall assembly was then validated 
and refined using Optical Mapping (OpGen, Inc.). A 
number of short contigs were incorporated into super- 
contigs with the assistance of optical maps and transpo- 
son-mediated random sequencing from fosmid clones 
(2112 reads). Following this analysis, the final contig 
number was reduced to 706. 

2.3. Sequence comparison 

An overall comparison of the K7 chromosomes 
with S288C (NC_001 1 33-NC_001 1 48), EC1 1 1 8 
(FN393058-FN393060, FN393062-FN393087, 
FN394216 and FN39421 7) and YJM789 (AAFW 
2000000) strain chromosomes was performed 
using MUMmer 3.0 software. 29 Similarity-based 
searches of individual genes were performed using 
BLAST 30 and BLAST2. 31 Phylogenetic analyses were 
carried out using CLUSTALW 1.83. 32 

2.4. Extraction of heterozygous positions 

To determine the positions of heterozygosity in the 
K7 genome, nucleotide positions containing one or 
more bases different from the consensus base with 
the Phred quality values >20 were automatically 
extracted as candidates for heterozygous positions. 
The identified positions were then manually validated 
on the electropherogram to validate the sequence. 
Non-supercontig contigs, which failed to be assembled 
into supercontigs, were not included in the analysis 
because their chromosomal positions were unknown. 

2.5. Gene prediction and annotation 

For predicting protein-encoding genes, ORFs larger 
than 90 bp were comprehensively included as candi- 
dates. ORF prediction was then carried out based on a 



direct comparison of S288C ORFs with the K7 
genome supercontigs. When direct comparison was 
difficult, ORFs were predicted using the software pro- 
grams CRITICA, 33 Glimmer2, 34 GlimmerHMM 35 and 
SIM4. 36 Finally, all K7 ORFs were manually validated 
by expert annotators. When one or more incomplete 
ORFs, such as those truncated by a sequence gap and 
lacking a start or a stop codon, were mapped to a 
single S288C ORF, each incomplete K7 ORF was anno- 
tated as a single ORF. Functional annotation was based 
primarily on the Saccharomyces Genome Database 
(SGD; http://www.yeastgenome.org/), secondarily on 
the Saccharomyces species database (yeast comparative 
genom ics: http: //www. broad i nstitute.org/a n notation / 
fungi/comp_yeasts/) and also on COG/KOG 
(http://www.ncbi.nlm.nih.gov/COG/) and DDBJ/EMBL/ 
GENBANK non-redundant databases. Orthology with 
the S288C ORF was evaluated using the BLASTP 
similarity and calculated as the percent of matched 
amino acid residues versus the total covered region 
between a K7 ORF and the best-hit S288C ORF 
(Supplementary Table S4) as truncated by a sequence 
gap. Similarity was calculated by the number of match- 
ing residues in only the corresponding regions of the 
S288C ORF. Dubious ORFs, ORFs in Ty elements and 
ORFs in telomeric regions were excluded as possible 
protein-coding genes and were not annotated. 
Prediction and annotation of RNA genes, Ty elements 
includingsolo longterminal repeats (LTRs) and telomeric 
elements were manually performed based on the results 
of BLASTN searches of the l<7 genome with the S2 88C 
sequences of these genes and elements as queries. 

All annotated ORFs and genetic elements were given 
individual numbers (Supplementary Table S4). 
Nomenclature of the K7 genes was based on the follow- 
ing rules: (i) each protein-encoding or RNA gene was 
named according to the orthologous S2 88C gene 
using the format 'K7_' plus the S288C standard gene 
name (with >80% similarity) and the systematic 
name (with >50% similarity) given in SGD; (ii) K7 
identification numbers or K7 original gene names, 
such asAVWA / , were given to genes that were non-ortho- 
logousorof lowsimilarity toS288C genes (with <50% 
similarity); (iii) each name of a gene truncated by a 
sequence gap or segmented by point mutations was 
followed by a lower case 'a', 'b' or 'c', such as 'XXX1a' 
and 'XXXIb', to show its correspondence to a partial 
region of the ortholog; and (iv) Ty elements and LTRs 
were independently termed according to the identical 
nomenclature used for S2 88C. 

3. Results and discussion 

3.1 . Sequencing and assembly of the l<7 genome 

Genome sequencing of the diploid sake yeast strain 
K7 was performed by a whole-genome shotgun 
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method that yielded 1 .49 x 1 0 5 sequence reads with 
an estimated 9.24-fold redundancy of the haploid 
genome. Following de novo sequence assembly, 1 7 
supercontigs were generated. Since homologous 
chromosome pairs of K7 were almost indistinguish- 
able from each other during the assembly process, 
nearly all the reads from each homologous chromo- 
some pair were assembled together into a single 
supercontig. Consequently, the resulting consensus 
chromosomal sequences represented a diploid 
genome, although they seemed to be that of a 
haploid genome. The total length of the supercontig 
corresponded to 98.1% of the estimated K7 genome 
size. The sequencing results and the assembled super- 
contigs are summarized in Supplementary Tables S1 
and S2. Comparison with the S288C genome revealed 
that the 1 7 supercontigs corresponded to the set of 
S288C chromosomes and mitochondrial DNA 
(Fig. 1; Supplementary Table S2). No DNA sequence 
for the yeast 2-|xg plasmid was detected from any 
reads, which was consistent with a previous study. 37 
In the assembly process, only 2 00 of 706 contigs were 
used for generating the supercontigs because the 
remaining 506 contigs failed to align with any of the 
supercontigs. The total length of these non-supercontig 
contigs was — 604 kb. Of the 506 non-supercontig 
contigs, 360 (71.1%) were singletons and 395 
(78.1%) contained <1000 bases (Supplementary Fig. 
S1 A and B). The majority of these contigs seem to be 
excluded from the supercontigs due to the repetition 
of telomeric and Ty-related type sequences. The 
remainder seemed to be excluded as a result of con- 
siderable heterozygosity between the two homologous 



chromosomes due to base substitution, in-del,Ty inser- 
tion or other such events (Supplementary Fig. S1 C and 
D). Several non-supercontig contigs contained ORF- 
like sequences that were not found in the supercontigs. 
For example, the nucleotide sequence of MATa was 
found in a non-supercontig contig due to its consider- 
able heterozygosity, while the supercontig correspond- 
ing to the l<7 chromosome III only contained the 
A/IATalpha sequence, even though K7 possesses both 
mating-type loci,MATalpha and MATa. Contigs including 
the sequence corresponding to VTH1 /VTH2, which are 
paralogous and nearly indistinguishable from each 
other, were not assembled in the supercontigs, presum- 
ably due to adjacent repetitive sequences. Similarly, the 
sequence for ARR3 was also present only in the non- 
supercontig contigs (data not shown). The non-super- 
contig contigs were excluded from subsequent analyses 
such as comparison with related strains, gene prediction 
and the survey of heterozygosity and Ty elements. 

To date, the genome sequences of several S. cerevisiae 
strains have been reported; therefore, we com pared the 
K7 genome with available genomes. 14,1 5,1 7 Pairwise 
nucleotide polymorphisms among four strains (l<7, 
S288C, YJM789 and EC1 1 1 8) were analyzed by 
sequence alignment using MUMmer 3.0 software. 29 
The number of substitutions and small indels 
between K7 and the three other strains ranged from 
-67 900 (5.6/kb) to 78 000 (6.5/kb) and -19 300 
(1.6/kb) to 23 500 (2.0/kb), respectively, while 
those among the remaining three non-K7 strains 
ranged from -46 1 00 (3.8/kb) to 56 700 (4.7/kb) 
and -14 800 (1 .2 /kb) to 1 6 700 (1.4/kb), respect- 
ively (Supplementary Fig. S2). These results indicate 




Figure 1. Dot-plot alignments of homologous chromosomes of S. cerevisiae strains I<7 and 
supercontig was compared with that of the homologous S2 88C chromosome (indicated 



S288C. The nucleotide sequence of each l<7 
above each alignment) using MUMmer 3.0. 
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that the phylogenetic position of K7 is relatively distant 
from that of S288QYJM789 and EC1 1 1 8, as expected 
from previous studies. 2,3,5-7,1 7 

3.2. Inversions in chromosomes V and XIV 

We compared chromosomal structures between l<7 
and S288C using MUMmer 3.0 software, 29 as shown 
in Fig. 1, Although the overall genome structure of 
l<7 closely resembled that of S2 88C, we identified 
two types of chromosomal rearrangements. One 
type involved several complicated subtelomeric 
rearrangements that are also observed or suggested 
by previous genome-wide studies of variousS. cerevisiae 
strains, indicating that such rearrangements were not 
infrequent events. 1 4,1 7,1 8,38-40 The other type of 
rearrangement was characterized by two large 
internal inverted regions. We confirmed that these 
inversions were homozygous by PCR analysis (data 
not shown). One inversion of the ~100-kb region 
on the right arm of chromosome V had not previously 
been described. Both boundary regions of this inver- 
sion on the l<7 chromosome V were flanked by two 
Ty2 elements (K7_YERCTy2-3 and K7_YERWTy2-4) 
that were inverted in relation to each other and 
were absent in the corresponding region of S2 88C. 
Therefore, these Ty2 elements have been proposed 
to mediate this reciprocal inversion (Supplementary 
Fig. S3). 15 Since Ty insertions differ by strain or 
lineage, this inversion would be unique to K7 or 
related lineages (Supplementary Fig. S3). The second 
identified inversion was a ~30-kb region on the left 
arm of chromosome XIV. This inversion was also 
observed in other strains. 1 5,1 7 Inverted homologous 
regions located close to the breaking points on 
chromosome XIV (YNL01 8C-YNL01 9C and 
YNL033W-YNL034W) may have mediated this reci- 
procal inversion. 

3.3. Uneven distribution of heterozygosity 

The sequence obtained for the K7 genome rep- 
resents the consensus haploid sequence derived 
from the two homologous chromosomes, although 
K7 is a diploid. Accordingly, we surveyed the positions 
of heterozygosity by carefully examining the sequence 
reads. A total of 1 347 heterozygous sites between the 
homologous chromosomes were detected, and their 
positions were subsequently mapped (Fig. 2). 
Interestingly, the windows containing multiple het- 
erozygous sites were unevenly distributed as clusters 
in several chromosomal regions. 

Simple evolutionary accumulation of point 
mutations is insufficient to explain such uneven distri- 
bution of heterozygosity. 41 It is more likely that a het- 
erozygous diploid was generated by the out-crossing 
of two different haploid strains and subsequent loss 
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Figure 2. Genome-wide distribution of heterozygosity between 
homologous chromosomes of l<7. Heterozygous sites were 
identified by manually checking sequence reads assembled 
within the supercontigs. The frequency of the extracted 
heterozygosity in homologous I<7 chromosomes was plotted 
by each 1 0 kb window of the chromosomal coordinates, x-axis, 
chromosomal corrdinates; y-axis, heterozygosity counts; arrow 
heads, position of centromeres. 



of heterozygosity (LOH), resulting in the observed 
pattern. Alternatively, a complex history of out- and/ 
or back-crossings of the ancestral strain in natural 
environments could have also caused this pattern. 
However, repeated backcrossing is unlikely to have 
occurred in nature since the sporulation efficiency of 
sake yeast, including K7, is markedly low. Thus, we 
speculate that sequential LOH events have resulted 
in the uneven distribution of heterozygosity in K7. 
Conversely, it is reasonable to presume that isolated 
heterozygosities were introduced by point mutations 
independent of LOH events. Genome-wide LOH was 
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also proposed for the diploid strain YJM128. 41 
However, LOH in K7 was far more extensive than 
that found in YJM128, resulting in 82.7% of the 
entire genome being almost homologous. As LOH is 
mainly caused by mitotic recombination, the prob- 
ability of LOH events is dependent on the number of 
clonal generations during asexual proliferation. 
Therefore, it is likely that K7 passed through more 
mitotic generations than YJM1 28 after the out-cross- 
ing event. A sporulation defect of K7 may have con- 
tributed to a long-term clonal proliferation that 
allowed extensive LOH events without meiosis. LOH 
events also result in the selection of one haplotype 
between two homologous chromosomes whose hap- 
lotypes differ from each other. Consequently, LOH 
can be a major driving force in the diversification 
and microevolution of diploid strains, such as K7, 
which lost a meiotic life cycle, as observed in 
Candida albicans. 42 



3.4. Gene predictions 

Following the sequencing and assembly of the l<7 
genome, we predicted and annotated 5815 ORFs on 
1 6 nuclear chromosomes and mitochondrial DNA 
(Supplementary Table S4). We observed many incom- 
plete ORFs interrupted by a sequence gap between 
contigs: 39 ORFs had a truncated terminal at one 
end and 124 ORFs (62 pairs) had internal gaps. 
When compared with the S2 88C genome, frame 
shifts caused by small indels and single-nucleotide 
changes at the start or the stop codons resulted in 
many ORF polymorphisms, including terminal dis- 
agreement, such as extension or truncation (132 
genes), segmentation of a single S2 88C ORF into mul- 
tiple K7 ORFs (89 genes corresponding to 43 ortho- 
logs in S2 88C) and the fusion of ORFs (13 genes 
corresponding to 26 orthologs in S2 88C; 
Supplementary Table S4). The influence of these poly- 
morphisms on each respective gene function is 
unclear and remains to be elucidated, although for 
several cases, such as the K7 ortholog of MSN4, poly- 
morphism appears to influence the characteristic fea- 
tures of K7. 43 

The average BLAST similarity between K7 ORFs and 
the most similar S288C ORFs was greater than 95% 
(95.7 and 95.8% at the nucleotide and amino acid 
levels, respectively). More than 90% of the K7 ORFs 
displayed a similarity exceeding 97%, with the corre- 
sponding S2 88C ortholog at the amino acid level 
(Fig. 3). Genes for tRNAs and other non-coding RNA 
were also annotated (Supplementary Table S3), 
which revealed that they corresponded to an almost 
complete set of S288C RNA genes. However, K7 lost 
three tRNA genes and one copy of two RUF5 ncRNA 
genes (data not shown). Their absence would not be 
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Figure 3. Identities of K7 ORFs with the top-hit S2 88C ORF. Each K7 
ORF was compared with the corresponding orthologous S288C 
ORF using BLASTN (nucleotide level) and BLASTP (amino acid 
level). The ORF proportion based on the observed identities 
was then plotted. 

expected to have an effect on the cellular functions 
of K7 because tRNA genes corresponding to a specific 
codon are highly redundant and one copy of RUF5 
was still present in the K7 genome. 



3.5. Differentially present genes between K7 and 
S288C 

The comparison of ORFs between K7 and S2 88C 
genomes disclosed 97 differentially present genes: 
48 ORFs unique to K7 and 49 ORFs unique to 
S288C (Supplementary Tables S5 and S6). In this 
analysis, we excluded differences in subtelomeric mul- 
ticopy gene families such as PAU, COS, DAN, SNO and 
MAL. Many of the unique genes were located at subte- 
lomeric plastic regions of K7 and S2 88C, indicating 
that they may have been acquired or lost by chromo- 
somal rearrangement events. Most differentially 
present genes located in internal chromosomal 
regions resulted from small mutations such as frame 
shifts or gene duplication events in a single strain. 

3.5.1. I<7 genes absent in S288C The genes pre- 
dicted as present in l<7 that are absent in S288C are 
listed in Supplementary Table S5. Several K7 genes 
already demonstrated involvement in the character- 
istic features of sake yeast, including K7_AWA1 
(K07_06182) and a paralogous set of K7_BI06 
genes (K7_BI06-1 /BI06-2a/BI06-2b/BI06-3 /BI06- 
4a/BI06-4b: K07_1 1 198/1 1203/1 1204/03384/ 
1 1 206/1 1207), which are unique to sake yeast 



strains. 



1 0,1 i 



8/0/, which is absent in S288C, is also 
required for biotin biosynthesis in yeast. 44 As expected 
from the biotin prototrophy displayed by K7, BIOI 
orthologs were also found in K7 (K7_BI01-1 /BI01- 
2 /BIOI -3: K07_00624/03 3 76/033 81). 11,45 

Three paralogous genes that are not found in S2 88C, 
K07_00009/1 1 194/041 00 (named K7_EHL 1 /EHL2 / 
EHL3 in this study), were predicted to encode proteins 
similar to bacterial epoxide hydrolase. Numerous 
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bacterial orthologs to K7_EHL 1 /EHL2/EHL3 have been 
identified in the DDBJ/EMBL/GenBank non-redundant 
database, whereas eukaryotic orthologs have only 
been found in S. paradoxus to date (Supplementary Fig. 
S4). Thus, these epoxide hydrolase genes may have 
been horizontally transferred from bacteria to a 
common ancestor of these yeasts. K7_EHL1 /EHL2 / 
EHL3 should be involved in the detoxification of 
harmful epoxide compounds. However, the actual sub- 
strates are unknown, and to date, no epoxide com- 
pounds have been identified from sake or fermenting 
sake mash. 

K7_02354, which was located in a subtelomeric 
region, is orthologous to YJM-GNAT of YJM789, 
which encodes a gene similar to bacterial GCN5- 
related N-acetyltransferase, suggesting that a 
common ancestor of l<7 and YJM789 acquired their 
ancestral gene by horizontal transfer. 15 Sequences of 
K7_023 54 and YJM-GNAT were almost identical 
(over 99%) at both the nucleotide and the amino 
acid levels (Supplementary Fig. S5). 

K7_KHR1 (K07_03 550), which encodes a pre- 
viously identified heat-resistant killer toxin, 45 was 
located in an internal region of chromosome IX and 
wedged between two solo LTRs. Although a similar 
structure was observed in the EC1 1 1 8 genome, 17 
S2 88C does not possess KHR1 , and only a solo LTR 
(YILCdelta3) is located in the corresponding locus. 
This suggests that the loss of KHR1 in S288C was 
caused by LTR-mediated recombination, as predicted 
in a previous study. 17 A large proportion of genes 
unique to K7 (Supplementary Table S5) have not 
been characterized, and their involvement and func- 
tion in the characteristic features of sake yeast 
remain to be explored. 

3.5.2. S288C genes absent in l<7 Forty-nine 
genes predicted as present in S288C but absent in 
K7 are listed in Supplementary Table S6. We con- 
firmed that these genes are not present in l<7 non- 
supercontigs. Notably, two subtelomeric paralogous 
blocks in the S2 88C genome, containing HXT15- 
SOR2-MPH2 on chromosome IV and HXT16- 
SOR1 -MPH3 on chromosome X, were not identified 
in the K7 genome. The paralogous pairs of HXT1 5 
and HXT1 6, SOR1 and SOR2 and MPH2 and MPH3 
encode nearly identical hexose transporters, sorbitol 
dehydrogenases and maltose transporters, respect- 
ively. 46-48 It is likely that non-reciprocal chromoso- 
mal recombinations in subtelomeric regions caused 
duplication of these sequences in S288C, but resulted 
in their loss in the K7 lineage. Another subtelomeric 
gene, A/F/, which is located on S288C chromosome 
XIV and encodes a mitochondrial cell death factor, 49 
was also lost in K7. CWP1 , encoding a cell-wall 
protein linked to glucan chains, 50 was disrupted by 



a frame-shift mutation in K7. Although the effects of 
the loss of this protein are unclear, it is possible that 
the cell-wall properties of K7 are affected. 

PPT1 , which encodes a protein phosphatase, 51 is 
located at an internal region of chromosome VII in 
S288C, whereas the corresponding 2.6-kb region 
was lost and replaced with a Ty element 
(K7_YGRCTy2-2) in the K7 genome (Fig. 4A). The 
effect of the loss of PPT1 on cellular function or the 
sake brewing character of K7 is unclear. A tRNA 
gene tR(UCU)G3 on the left side of PPT1 was also 
absent in K7. However, the tl(AAU)G gene located 
on the right side of PPT1 in S288C was present in 
K7. Sequences located several hundred bases 
upstream of a tRNA gene can serve as potential 
target sites for Ty integration. Consequently, multiple 
Ty insertion and subsequent excision events may have 
resulted in the two solo LTRs on both sides of PPT1 in a 
l<7 ancestral genome. Recombination, including inser- 
tion or gene conversion, between a Ty cDNA and yeast 
chromosome with LTRs, was reported in laboratory 
experiments. 52 Therefore, we speculate that gene 
conversion between a double-stranded Ty cDNA and 
two solo LTRs pre-existing on the chromosome VII 
may have occurred in the K7 ancestor, resulting in 
the replacement of PPT1 with Ty2 (Fig. 4, 
Supplementary Fig. S6). Alternatively, double recipro- 
cal crossing-over may have resulted in the observed 
structure; however, this is unlikely, as PPT1 was not 
found in either the supercontigs or the non-supercon- 
tig contigs of K7. A similar structure was also observed 
in the K7 right arm of chromosome I (Fig. 4B), invol- 
ving the replacement of two tRNA genes tL(CAA)A 
and tS(AGA)A with Ty2, which we speculate may 
have involved a similar gene conversion mechanism. 
In previous studies, Ty cDNA-mediated gene conver- 
sions were only observed in genetically modified 
strains and were thought to be an artificially 
induced phenomena. 52 The structures identified in 
the K7 genome would represent the first example of 
the spontaneous direct Ty-mediated gene conversion 
in a wild-type strain. 

Two tandemly-duplicated acid phosphatase genes, 
PH03 and PH05, are located on chromosome II of 
the S288C genome. 53 However, K7 possessed only 
PH05 (l<07_00381), consistent with the observation 
that K7 shows repressive, but not constitutive, acid 
phosphatase activity. 54 These tandemly arranged 
acid phosphatase genes are also present in YJM789, 
EC1 1 1 8 and several other Saccharomyces 
species. 1 5,1 7,55 The prevalence of these genes suggests 
that PH03 was looped out of the K7 genome by hom- 
ologous recombination. In the K7 genome, no copies 
of ASP3 and its neighboring ORFs, which encode 
cell-wall L-asparaginase II and putative proteins of 
unknown function, were found, even though S2 88C 
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Figure 4. Loss of genomic regions by Ty2 replacement in the I<7 genome. Each region within the dotted-line box in the S2 88C genome was 
replaced by a Ty element (Ty2) in the K7 genome. (A) A 2628-bp region containing the tRNA gene tR(UCU)G3 and PPT 7 /YGR1 23C in 
S2 88C chromosome VII was replaced by K7_YGRCTy2-2 in I<7 chromosome VII. (B) A 2464-bp region containing two tRNA genes, 
tL(CAA)A and tS(AGA)A, in S288C chromosome I was replaced by K7_YARWTy2-1 in K7 chromosome I. 



chromosome XII contains at least four copies of ASP3 
adjacent to rDNA repeats. 56 These sequences are 
also absent in numerous other S. cerevisiae 
strains. 15 " 17,39 Although tandemly triplicated ENA1 j 
ENA2/ENA5, encoding P-type ATPases are located on 
chromosome IV in S2 88C, 57 only one copy was ident- 
ified in K7 (K7_ENA1 /K7_01 190), as observed in 
many other strains. 1 5-1 7,39 The triplication is a rela- 
tively specific feature of S2 88C and its related 
strains. The K7 genome also contained one copy of 
CUP1 (K7_CUP1 /K07_03 1 1 6), encoding a metal- 
lothionein, whereas S288C contains two copies in 
tandem, 58 consistent with the lower copper resistance 
of K7 than that of X2180, an isogenic diploid of 
S2 88C (data not shown). 

3.6. Hexose transporter and GAL genes 

Transport of sugars across the membrane is one of 
the key steps in ethanol fermentation. The S2 88C 
genome contains 20 genes that encode hexose trans- 
porter family proteins (HXT1 -HXT1 7 and GAL2) and 
glucose sensors (SNF3 and RGT2) as shown in Fig. 5. 
The products of these genes show different glucose 
affinities and are differentially expressed in order to 



coordinately control glucose uptake in environments 
with a broad range of glucose concentrations. 59 
Many of these genes, including two glucose sensors, 
were highly conserved among S288C and K7. In par- 
ticular, both the low-affinity glucose transporter 
genes HXT1 /HXT3 were conserved, implying that 
they may be responsible for glucose uptake during 
the sake brewing process, as reported in wine 
yeast. 60 HXT5 /HXT6/HXT7 were located at contig 
ends, and the DNA sequences of these regions in l<7 
were not completely analyzed. In S288C, HXT6 and 
HXT7 encode nearly identical proteins and are 
arranged in tandem; however, they may have been 
combined into a chimeric gene (HXT6/7) in K7, as 
observed in other strains. 59 

We revealed that two distinct classes of S2 88C HXT 
genes displayed almost altered gene structures in K7. 
The first contains HXT9/HXT1 1 and HXT) 2, which is 
annotated as a possible pseudogene due to a frame- 
shift mutation in SGD. In the K7 genome, although one 
of the duplicated HXT9 orthologs was conserved 
(/<7_HAT9-2/K07_02205: 98% identical to HXT9 at 
the amino acid level), all other HXT9/HXT1 1 orthologs 
were divide into two ORFs due to frame shifts 
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Figure 5. Phylogenetic tree of hexose transporter family proteins 
identified in S288C. The amino acid sequences of S288C HXT 
gene products were clustered by CLUSTALW 1.83, and the 
dendrogram was plotted using the TreeView program. Symbols 
at the end of the branches indicate the structural class of l<7 
orthologous gene products, as shown in the box under the 
dendrogram. Two groups of Hxts, whose functional structures 
were disrupted in l<7, are framed with rectangles. Underlines 
indicate genes whose expression is repressed under high- 
glucose conditions. The letter (a) denotes each of the two l<7 
orthologs is indicated as a half round. 

Q<7_HXT9-1 a/HXT9-1 b/HXTl 1a/HXT1 1b: K07_03662/ 
03663/061 79/06180). The disrupted structure of 
HXT12 was also conserved in K7 (K7_HXT1 2a/HXT1 2b: 
K07_03394/03395) as well as in S288C 
(Supplementary Table S4). Four genes HXT13/ 
HXT1 5/HXT16/HXT1 7 are classified into the second 
HXT class. In the K7 genome, the HXT13/HXT17 
orthologs (K7_HXT1 3a/HXT1 3b/HXT1 7a/HXT1 7 b: 
K07_01 805/1 807/061 62/ 06164) contained 
frame-shift mutations that may have resulted in the 
loss of function (Supplementary Table S4), while the 
sequences corresponding to HXT1 5/HXT1 6 were 
absent in the K7 genome (Supplementary Table S6). 
Thus, in K7, the functions of gene products in this 
class appeared to be completely lost, although their 
molecular functions were unclear. 

In the K7 genome, we also identified frame-shift 
mutations in the GAL3 and GAL4 genes that divided 



each gene into two ORFs (K07_01156 and 
K07_01157 for GAL3 and K07_06847 and 
K07_06846 for GAL4). Thus, both Gal3p and Gal4p 
in K7 are shorter at their C termini than their ortho- 
logs in S2 88C, suggesting that their molecular func- 
tions may be impaired. Since Gal3p and Gal4p 
function as an inducer and activator, respectively, 
which constitute the transcriptional induction 
system of galactose assimilating genes, 61 the loss of 
Gal3p and Gal4p functions may lead to defective 
GAL gene induction in response to galactose. 
Consistent with this speculation, the assimilation 
and fermentation of galactose are remarkably wea- 
kened in several sake yeast strains, including K7. 62 

Our analyses suggest that l<7 possesses different 
sugar uptake and assimilation properties from those 
of S2 88C. It is likely that sake yeast cells could tolerate 
a functional loss from these genes without a negative 
selective force due to the growth of these strains in 
the glucose-rich environment specific to sake mash. 

3 . 7. Structure of the HO gene 

As l<7 is a heterothallic diploid, HO involved in 
mating-type switching was predicted as non- 
functional in l<7. 63 Indeed, a homozygous mutation 
of A1424T (H475L), which was reported to cause a 
loss of function in S288C, 64 was observed in the K7 
ho allele. Moreover, a 36-amino acid deletion at 
524-559 was also present in K7, as was reported in 
the heterothallic bioethanol strain JAY291.' 8 
Collectively, these differences are likely to be respon- 
sible for the observed K7 heterothallism. 

3.8. Comparison ofTy elements and LTRs 

The chromosomal insertion of a Ty element can 
result not only in the loss or alteration of gene function, 
but may also modify gene expression levels. The anno- 
tated Ty elements and solo LTRs from l<7 are summar- 
ized in Supplementary Tables S7 and S8. Nearly all Ty 
and solo LTR insertions followed the target-site-selec- 
tion rule, displaying preferential insertion within a 1-kb 
upstream region of RNA polymerase Ill-target genes 
(Supplementary Table S8). We compared Ty insertion 
events, which included intact Ty elements and solo 
LTRs, between K7 and S288C. If flanking regions of 
two Ty elements, a Ty element and a solo LTR or two 
solo LTRs, were identical between the two strains, we 
estimated that these insertion sequences should be 
derived from the same Ty element insertion event in 
the common ancestral strain. A total of 1 98 Ty inser- 
tion events were estimated to be identical between 
l<7 and S2 88C, while 1 21 and 1 37 insertion events 
were unique to K7 and S288C, respectively 
(Supplementary Tables S7B and S8). Interestingly, 
among the same 1 98 insertion events, only one pair 
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maintained intact Ty structures in both strains: 
K7_YCLWTy5-1 and YCLW Ty5-1. In addition, two 
pairs kept intact Ty structures in one of the strains: 
YARCTyl -1 in S288C and K7_YERWTy3-1 in K7. All 
other insertion events were observed as solo LTRs, 
representing trace sequences of Ty elements 
(Supplementary Table S8). This observation supports 
the idea that same insertion events may occur in the 
much more distant past than the unique insertion 
events. In our analyses, we were unable to locate K7 
genes that were interrupted by a Ty insertion. 



4. Conclusions 

We revealed the sequence and structure of the l<7 
genome that represents the first such study of a sake 
yeast lineage within S. cerevisiae and provides the 
basis for future studies on the brewing characteristics, 
genealogy and evolution of sake yeast. The phenotypic 
effects of the identified structural polymorphisms 
between K7 and S2 88C genomes are largely 
unknown and remain to be explored. In addition, 
the uneven heterozygosity distribution found in the 
K7 genome is suggestive of the microevolution of K7 
and related sake yeast strains. Future genetic studies 
encompassing a wide range of Saccharomyces strains 
are necessary to resolve the genetic basis for the 
characteristics and evolution of sake yeast strains in 
greater detail. 

The sequences and annotations reported inthis paper 
have been deposited at DDBJ/EMBL/GenBank underthe 
accession no. BABQ01 000001 -BABQ01 000705, 
DG000037-DG000052 and AP01 2028. Information 
of the sequence and gene annotation are also available 
on the sake yeast genome database (http://nribf1 .nrib. 
go.jp/SYGD/) and database of the genomes analyzed 
at NITE (DOGAN; http://www.bio.nite.go.jp/dogan/ 
top/). 
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